Mm-hmm. Actually uh you can t perhaps just uh just Th there is one frequency that is the base from for uh Okay. Both. Mm the pitch and uh and the freque Okay, s and you can measure measure that? Or Okay. And i is it uh expensive to do, But in fact if y you speak of uh a lot of things that are not expensive, at the end you have uh something that is. Hmm. So you have to do F_F_T_, because otherwise Mm-hmm. Yes. Yes, but it I it it is im It i it yes, it is important for detecting patterns. Because uh if you say uh let's say uh w how much the person um have spoken over the the five last minutes, uh Uh if someone come and yes. Yes, yes. Y This is uh mm. Okay. So i uh this is only based on uh you give uh the spectrum and uh uh okay. Mm-hmm. Yes. But uh the last ones are always the Mm not really. Mm yes. Hmm uh I I said uh we want t to know to um to do separation just because of the voice recognition. I mm this is not the yes. Yes, no, but but in fact we have uh yes, we are designing uh a noise sensitive table just f from your ring, but we are also designing uh a prototype um to analyse the conversations. So uh we might add there a lot of uh features. Okay. And uh I don't know, perhaps uh perhaps if you have two people speaking uh speaking uh at the same time, uh I don't know if uh voice recognition of an uh such a stream is uh real. Okay. Yes, so that's fine. Mm-hmm. Sure. Yes, it's not um every important feature for now, but Mm-hmm. Yes. And uh uh another what uh. Um I don't remember. Yes, also uh to to extract the context. Mm because if you know the words that are spoken, you can let's say uh someone say uh he's angry uh about the word he's seen. Yes. You mi you might want to add uh this kind of this is not in uh my project, but Yes. Um Yes. And have a special events trigger if Sure. Sure. No, it's it doesn't sound simple, but yes, you can yes. Sure. Yes, in fact uh for now we are using uh a kind of calibration, but uh it's still um Yes. Yes, um uh but in fact you have to sit in front of the microphone, and it's uh yeah, people one speak and people two and people three and you are just uh subtracting the the level that you you kept in the other microphones. Okay. Okay. Mm-hmm. Mm-hmm. Okay, just uh a last question, um when you mentioned it. Um if I'm speaking very uh, how do you say it, um yes, I'm just speaking one word every uh every one second. Yes a. Uh is the the event that is detected is um will be uh a sum of uh small parts? There won't be uh okay. So you yes, you have to to reprocess to uh to say let's say yes. There is a very small um interval. You have to consider it uh. Because this is yes, this is one this can be considered as one interv intervention. It might be. This is not a very uh easy problem. I'm done. Sure, yes. And you can code Java. You can choose Matlab. Yes, sure. You have to code it uh directly to the machine. Hmm. My sheet is very uh filled up. I will take yours.
Okay. Mm-hmm. Or actually look at the pitch, no? Don't ask me how it's done I I know the theory, so, yeah. Isn't it the pitch and the frequency? The base frequency of the And if you have both you can um well, you can compare this to to one stored and so get out uh which person it is. Pardon? And uh well, the things that make the sou the voice. Yeah. And and the the holes in your head, everything, the the nose and so on, and that's that's really uh unique for ev well almost unique for every person, so. Well once you get this information, your frequency and uh your vocal contract or how it's called, uh y you can use this information to compare to to to the frames here. No, but No no no no. That that that everything is done through through the voice here, you take a voice of someone, and then you can get out this information, it's it's not p it's clear it's not perfect, but s it's enough to to to use it for for different purposes and uh Um not not that, well uh as far as I know. Actual Is it called P_C_M_? I thi um well it's it's not a technology s yeah. It's more or less used this to compress audio. I have to I want to check it, I can show it you afterwards. well I did a course about it, but uh I shouldn't say it too loud, because um the professor is working here at IDIAP. I should reread what I signed there so. Well uh it I'm even not sure if it's that important to know which person is talking, because uh if uh if it I think about a table, or so something uh physical, uh you just want to know in in which corner of the table uh speech is coming and not uh which person produced the speech, so. Yeah, but it's not really related to the ah, yeah, I see it's uh you need a bit both uh Sound. You might might want to do. You need to separate it. No, but I think once you have the voice for a person, you can start to do this and uh once uh people have speech recognition or uh far enough to t Mm uh. I think so. The link. Mm. Okay. No, I think does this work, email? Does it work if I put in my email and uh No, if it's not uh I guess you would have to install it and so on. Uh no no no, you see uh I thought you can put in here your email address and and just uh do this and then it mails me. Yeah, you see I want too much. And after all Matlab is also programming, so it Mm-hmm. So. Thank you for the information.
Okay. Yes. So you were talking about the two first question? Okay. And a little bit about same people moving, with the clustering. Uh-huh. So this one. What? I'm sorry, I didn't really get the point of comparing the pitch the pitch is the is the frequency that is okay. And the way you transform it uh v by the mouth? You were comparing what with what? Uh-huh, okay, okay. So uh And you were talking about comparing what with what? Okay. But you want to have the information because the the mouth, you move it. To scan t all the user. Okay. Okay, so you have the p the pitch and you have information about the voice dynamic or Mm mm mm. Emotion, the involvement in a discussion. If you are saying oh yeah yeah we should Okay. In C_P_U_ you mean? Not really familiar. Hmm. Okay, so you are the expert. Careful, you are recorded. Okay. Okay, so all this stuff is done from the F_F_T_. Also location to identity. Uh Mm-hmm. Okay. Mm-hmm. But to get a cluster the location that is uh a place where uh noise is regularly coming from. Mm-hmm. Mm mm mm. But for instance for a setting like uh here it would be enough, because we are not moving. Okay, right. But we are uh targeting a place where we don't know i where it will be more flexible, so we need more than just uh location if we want to uh Mm-mm. I agree. Maybe just I think it may it may be okay, but it depends how long Hmm. From th from the location. Oh. Mm-hmm. When are you planning to finish your Okay. More extensible. Mm. I don't know how mu maybe we would be more interested to know that it's not the same people, more than knowing that it's the same people. Or to detect that someone is more that was not talking before is talking now. Yes. Mm. Mm. Okay. That are further directions. Professor was interested to know when people were talking about him. Ah, someone is talking about me, let's listen. And lighting a on his office. Okay, so are you done w with what you wanted to presented us uh in regards with these questions? Mm mm. three papers about? Mm-hmm. Okay. It's n no problem. No more time for problems. Okay. Uh uh. Mm mm mm. Slowly. Might. For the first application we will do with uh prototypes, it will be more about quantity uh of speech or how long, how much someone speak, but when we will go to final uh details, if I'm talking and someone say yes yes yes from time to time, it show that he's participating, and obviously it makes different that if someone is talking and one is replying never uh nor moving or nor having eyes open Okay. Go through them. Okay. No, I think uh I think we has already have plenty to read. Try. And the application you are using, all of them uh y every time you are using Matlab, or you are programming everything is based based on Matlab? Mm. Mm mm mm. Okay. I'm writing so only me can read. It's uh a kind of secret language. Okay.
So b maybe I'll go back to your questions. Um Yeah. Um Yeah. And this also um allows you to separate people. Um and the questions I did not talk about was uh going from location to uh identity of the person. Um That is quite different, you you need to use a signal you have separated and uh do the type of analysis I mentioned. Um from the spectrum you can transform the the magnitude spectrum and um um build model of the person. Um Yeah, pitch at least. But pitch is Pitch isn't is not enough sometimes. I it's a frequency, yeah, yeah. It's a frequency of your vibration here. Uh Yeah. And then it's transformed through the mouth, that's the usual model. Yeah. I yeah, um it's not necessarily done explicitly, but it's equivalent to that, yeah. So again, F_F_T_ is useful to do this type of analysis. Um Yeah, but it's quite personal. Uh Yeah. Yeah. Yeah, yeah. It it's an information equivalent to that. It's maybe n you don't need to do the whole uh complicated modelling, but No, no. Yeah, rate of speech is also possibility. Um how fast you speak, that's quite personal also. But it varies over time, um depending on emotion. That also might be interesting for you, detecting emotion. Yeah, yeah. And us yeah, yeah, so these measures are a way to quantify that. Uh p no. No, no it's not. What w w what is m what is more expensive is to um take the decision finally. So your decision can be for example uh who is it, or if the person moved, which is probably the most complicated thing, like the person goes away, then comes back ten minutes later, sits at a different place. Uh you will need to build statistical models of um the person identity using these measures. So the ones um he mentioned um I don't know if you're familiar with that. Yeah. Yeah. So um No. Yeah. S Yeah. All I was saying is that these different measures are a way to uh evaluate the identity of the person. So if with location all you can do is extract segments of speech, for example, but it will not tell you that these are the same person. It's just a location. Yeah. Yeah. Yeah, so that might be good for example for ten minutes, but then the person might move, or a different person might come. Yeah. It is. Well, unless sometimes somebody goes there or yeah. Yeah. Mm-hmm. Yeah. Yeah. Yeah. So it hmm. But that you can do with location, if it's only on the short term. On on the off-line yeah. I would say you might use these measures in a simple way if you do it uh on-line. Like um to give feedback. But uh you you mentioned that another side of your project is also to analyse off-line. So there you might want to use a We had a student here, he finished uh one or two years ago, but he left his software, and his software is precisely to do that to um cluster at uh another level, to cluster these small clusters of speech, uh group them by person automatically. Y no, from the these measures, pitch, etcetera. So um Yeah, exactly. So I would say from location at the lower level you can get small parts of speech, and then at the next level, possibly off-line, you can uh group them into um a speech segment. I'm hoping to have enough time to try um doing that before I finish my thesis. Um so if I manage to do some um uh practically usable software, I'll tell you. Uh in five, six months, abou approximately. More busy I would say. Yeah. So you said you want to do separation, um but ex exactly, yeah. Yeah. Okay. Okay. Ah, you want to do voice recognition. You might want. Okay. Okay. Yeah. So yeah, once you have locations it's not a big problem. Um Mm-hmm. No no, you should separate them first, yeah, yeah. So So the type of things I presented can lead to separation. Um it's only a very small step to add. Yeah, yeah. Although it might be, if you want to um ex extract these features like pitch and uh rate of speech, or energy simply, from each person. Um you might have to do some ba basic separation. 'Cause quite often people uh don't realise that they talk at the same time, um interrupt each other. So even if you don't want to recognise a speech fully, at least um you need yeah. It's okay. You can email later. Yeah. Um So Yeah. Uh Okay. A semantic context, yeah. Yeah. Yeah, then mm. Hmm. So it's uh Yeah, you could detect one word, yeah. Keyword spotting, yeah. So um Yeah. Yeah, I I'll just say uh if you want I can point to um three papers, um but no, I think it's better if I just send you the link, it's probably simpler. Uh one im is about this um sector based stuff, like uh the localisation and detection. One is about how to take the decision, that somebody's active or not. Um it sounds simple at first, you think you just put the threshold and that's it, but um the problem is uh the It no, you can do that. There's no problem, but i if you want your your system to work in different conditions, like cafeteria, library, um the environment will be quite different, so a single fixed threshold value might be a problem. S Yeah. So you can do that automatically, this kind of things. Um I don't know if you're doing that already, but uh Okay. So uh I have another paper which might be interesting for that for a single channel uh calibration. And I've code on-line, so for this particular one. And the last one I would say is um the clustering um to um yeah, the clustering of the different locations over time, so that you can get small clusters of speech. Um there are other ways to do it, but uh I think it's it's important because um in spontaneous speech all these words are quite small. So you need to do that uh adequately, yeah. But otherwise yeah, I'm done. Hmm. Yeah, yeah. Many small events basically, yeah. Yeah, basically uh I mean it depends uh Yeah, even if somebody says just yes. I don't know if it's important for you or not. It might be. Yeah. Yeah. Hmm. Yeah. Y Yeah. Yes. Um So I I'll send you links some of the papers are long, but you can just um briefly look, yeah. Um we can go and see Olivier, unless you want to talk about something else. Huh? Well, we can try. But uh We have yeah, we have O_C_R_, but maybe not on this computer. Yeah. That is good idea, yeah. It's a good idea. Yeah, yeah. Yeah. Once I I um coded the most expensive part in C_ just to see so you can have a mix of both. From Matlab you call C_ and back. But the day you want to do uh on-line real time stuff nah, I wouldn't wouldn't trust it.
